MSC. biology, currently working at SYKE, ca. 5 years experience. Main interests: sustainability at macro scale, national EE-IO analysis, food system sustainability modeling, circular economy..
In my projects at SYKE we’re utilizing R in various capacities, on a daily basis. I’m expecting to expand my skill level concerning R, and maybe find new tools/packages to use in my work. Maybe this’ll be easier when there’s some structure to the learning.
Our team just implemented Git as well, so it’s nice to get a bit deeper into it as well.
Additionally, the statistical component will most likely benefit me as well. I wish to expand my repertuare of analytical capabilities overall. At this point, I’m not looking to learn any single particular analysis type (since it’s hard to tell what will be useful in the coming years), so it’s good to have a solid reference base
Initial impressions suggest this would’ve been a good learning material when I first started dealing with R. The basics are made simple to understand. but the scope seems to go quite a bit deeper than that. There seems to be plenty to learn and I’m tempted to treat this as a reference material in future too (or a tutorial for colleagues, if I’m allowed to share it outside the course?)
## [1] "Chapter complete on: Wed Dec 6 16:01:33 2023"
Describe the work you have done this week and summarize your learning.
if (!require(here) == T) {
install.packages("here") #Checks if the package is NOT installed. If this evaluates to TRUE, runs installation.
}
library(here)
if (!require(tidyverse) == T) {
install.packages("tidyverse")
}
library(tidyverse)
if (!require(GGally) == T) {
install.packages("GGally")
}
library(GGally)
ggpairs(Data, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))
x<-lm(Points ~ deep+Attitude+Age, data = Data) #Create model, assign a name "x" to it
summary(x)
##
## Call:
## lm(formula = Points ~ deep + Attitude + Age, data = Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.206 -3.430 0.204 3.979 10.950
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.60773 3.38966 4.605 8.32e-06 ***
## deep -0.60275 0.75031 -0.803 0.423
## Attitude 0.35941 0.05696 6.310 2.56e-09 ***
## Age -0.07716 0.05322 -1.450 0.149
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.307 on 162 degrees of freedom
## Multiple R-squared: 0.2043, Adjusted R-squared: 0.1896
## F-statistic: 13.87 on 3 and 162 DF, p-value: 4.305e-08
z<-lm(Points ~ Attitude, data = Data) #Create model, assign a name "x" to it
summary(z)
##
## Call:
## lm(formula = Points ~ Attitude, data = Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## Attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
plot(z, which=c(1,2,5))
source(here("Scripts/create_data_assignment3.R"))
taulukko<-alc %>% group_by(high_use) %>% summarise(`Average absence` = mean(absences),
`St.Dev, absences.`=sd(absences),
`Average health` = mean(health),
`St.Dev., health`=sd(health),
`Average failure` = mean(failures),
`St.Dev., failures`=sd(failures),
`Average family relations` = mean(famrel),
`St.Dev., family relations` = sd(famrel))
gt(taulukko) %>%
fmt_number(decimals=2) %>% cols_align("center") %>% opt_stylize(style=1, color="green") %>% tab_header("Summary table")
| Summary table | ||||||||
| high_use | Average absence | St.Dev, absences. | Average health | St.Dev., health | Average failure | St.Dev., failures | Average family relations | St.Dev., family relations |
|---|---|---|---|---|---|---|---|---|
| FALSE | 3.71 | 4.46 | 3.49 | 1.41 | 0.12 | 0.44 | 4.01 | 0.89 |
| TRUE | 6.38 | 7.06 | 3.73 | 1.40 | 0.35 | 0.75 | 3.77 | 0.93 |
ggplot(alc, aes(x = high_use, y = absences)) + geom_boxplot() + ggtitle("Distribution of `absences` values by group")
ggplot(alc, aes(x = high_use, y = health)) + geom_boxplot()+ggtitle("Distribution of `health` values by group")
| Frequency of `failure` values by group | |
| failures | n |
|---|---|
| FALSE | |
| 0 | 238 |
| 1 | 12 |
| 2 | 8 |
| 3 | 1 |
| TRUE | |
| 0 | 87 |
| 1 | 12 |
| 2 | 9 |
| 3 | 3 |
# find the model with glm()
m <- glm(high_use ~ failures+absences+health+famrel, data = alc, family = "binomial")
# create model summary and extract the coeffs
summary(m)
coeffs<-m %>% summary() %>% coefficients()
OddsRatio<-exp(coeffs)
ConfInt<- confint(m)
Result<-cbind(OddsRatio, ConfInt)
print(Result)
## Estimate Std. Error z value Pr(>|z|) 2.5 % 97.5 %
## (Intercept) 0.4634471 1.790547 0.2670735 1.205335 -1.93186722 0.36139875
## failures 1.8054371 1.219527 19.6267894 1.002916 0.20659753 0.99228511
## absences 1.0848599 1.022825 36.9319701 1.000307 0.03933340 0.12816947
## health 1.1387751 1.092594 4.3383139 1.152858 -0.04154535 0.30645324
## famrel 0.7598363 1.137448 0.1185288 1.033507 -0.52884754 -0.02203822
m <- glm(high_use ~ failures+absences+famrel, data = alc, family = "binomial")
here, we take the model from before, and based on it calculate the probability that a certain student (row in data) has the attribute “Strong drinker”.
Then, we use the probabilities to make a prediction of high_use.
The model takes actual values from the data, and based on them estimates how likely it is that a certain row (student) has the attribute (heavy drinker) These prediction may, or may not, match what the actual value was on each row.
alc$predicted_probabilities <- predict(m, type = "response") #type determines which outcome is given, see ?predict
alc <- mutate(alc,prediction = predicted_probabilities > 0.5) #Selects those, for which the model indicates should have more than 50% prob of being a heavy drinker. T if over 50, F if not.
print(x)
## prediction
## high_use FALSE TRUE
## FALSE 242 17
## TRUE 92 19
print (y)
## prediction
## high_use FALSE TRUE
## FALSE 65.405405 4.594595
## TRUE 24.864865 5.135135
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
Task: Find the proportion of “light drinkers” that the model misclassified as “heavies”
Set probability of being heavy drinker at 0%
Function takes:
Say, that it finds a student with a class of 0.
Substracts the probability (0) and class value. If class is 0 and prob is 0, result is also 0, lower than 0.5 (FALSE).This is a correct classification, as we just set the probability of being a heavy drinker at 0.
If it were a positive number, it’d have found a heavy drinker (1- 0 = 1, which is > 0.5, TRUE), despite the prob being 0 (=found a misclassed case)
A converse case: we want to find cases in which class is 0, even though we set the probability of being heavy drinker at 100 %
Conversely, if it was set at 1 (100% likelihood of being a heavy
drinker), and the function finds a class of 1, the function would look
like: abs(1-1) = 0, which is not > 0.5. We would have a correctly
classed case.
If it finds a misclassified one, it’d be (0-1) = -1, which as an absolute number (drop the minus) is higher than 0.5.
In each scenario, the function also calculates the mean
Here, we set the function to find students of class 0, and check if they are mismatched as 1
loss_func(class = alc$high_use, prob = 0)
## [1] 0.3
loss_func(class = alc$high_use, prob = 1)
## [1] 0.7
The opposite case (0’s found when probability was 1) was 0.7.
Model appears to misclassify 0:s as 1:s almost doubly more often, than the opposite way.
# K-fold cross-validation
library(boot)
## Warning: package 'boot' was built under R version 4.3.2
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.3
further delving into the bonus section if time allows…
Not included, as this time it is listed as the final assignment and deals with next week’s data. This has been completed; create_human.R” is available in the repo under Scripts.
require(tidyverse);require(here);require(MASS);require(corrplot); require(GGally)
data(Boston)
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
Overall the dimensions are 506 x 14.
All of the 14 variables appear numerical (integer or double).
Next, we look up what each of the 14 vars represents.
The dataset has an official documentation.
The data contains information on housing in the Boston region, USA.
Included variables are:
Next, graphical and table format summaries are generated for the data
First, summary as a table
summary(Boston)
ggpairs(Boston)
There’s quite a lot of options on what to look at here. I’m going to cherry pick some findings, instead of going through every variable.
Multiple variables have skewed distributions. For example:
Also, several variables have bimodality (= more than 1 peak in the distribution, meaning that 2 values are more common than others)
Scatterplot Indus x NOx appears to show 2 distinct groupings? For most data points, both NOX concentrations and the share of industrial acreage are low (these have a strong, statistically significant correlation coeff too!)
Crime rate appears to have a statistically significant correlation with almost all of these vars. Seems to correlate positively with the proportion of industrial acreage, NOX concentrations, large residential land areas…
The distribution of “indus” (business acreage) shows bimodality: we have two peaks, indicating that a couple of values are considerably more common than others. This variable also correlates w. high statistical significance with NOX emissions, which makes sense as the variable represents the prevalence of business acreage like industry.
Distribution of NOX is strongly skewed towards small values.
Age skews strongly towards high values. Overall, most construction in the regions of the data was done prior to 1940.
Again, we see bimodality in property tax rates. Low and high ends of the spectrum have clear peaks.
boston_scaled <- as.data.frame(scale(Boston))
summary(boston_scaled)
summary(Boston)
bins <- quantile(boston_scaled$crim)
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
boston_scaled <- data.frame(boston_scaled, crime)
boston_scaled$crim<-NULL
*Dividing the data to training set and testing set
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = nrow(boston_scaled) * 0.8)
# create train set and test set
train <- boston_scaled[ind,] #Randomly selects 80% of rows, index numbers
test <- boston_scaled[-ind,] #everything except the indices in the training set
lda.fit <- lda(crime ~ . , data = train)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
graphics::arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results (select both lines and execute them at the same time!)
plot(lda.fit, dimen = 2)
lda.arrows(lda.fit, myscale =2)
correct_classes<-test$crime
test$crime<-NULL
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
*Clearing .env, reload Boston raw data and scale it
data(Boston)
Boston_scaled<-as.data.frame(scale(Boston))
*Calculating distances between data points
dist_eu <- dist(Boston_scaled)
Run K-means clustering algorithm on the scaled data here, we run it on 6 centers as optimization will follow.
# k-means clustering
km <- kmeans(Boston_scaled, centers = "6")
pairs(Boston_scaled[1:6], col=km$cluster)
*Quite a confusing table…K of 6 is arguably too many.
*Optimizing K
# Work with the exercise in this chunk, step-by-step. Fix the R code!
# MASS, ggplot2 and Boston dataset are available
set.seed(123)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston_scaled, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
km <- kmeans(Boston_scaled, centers = 2)
pairs(Boston_scaled[1:5], col = km$cluster)
pairs(Boston_scaled[6:10], col = km$cluster)
pairs(Boston_scaled[c(1,10)], col = km$cluster)
pairs(Boston_scaled[c(3,5)], col = km$cluster)
pairs(Boston_scaled[c(7,14)], col = km$cluster)
pairs(Boston_scaled[c(1,14)], col = km$cluster)
data("Boston")
Boston_scaled<-as.data.frame(scale(Boston))
km <- kmeans(Boston_scaled, centers = "5")
lda.fit <- lda(km$cluster ~ . , data = Boston_scaled)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
graphics::arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(km$cluster)
# plot the lda results (select both lines and execute them at the same time!)
plot(lda.fit, dimen = 2)
lda.arrows(lda.fit, myscale =2.4)
#More zoom
plot(lda.fit, dimen = 2)
lda.arrows(lda.fit, myscale =20)
library(plotly)
## Warning: package 'plotly' was built under R version 4.3.2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
*Calculate matrix products
lda.fit <- lda(crime ~ . , data = train)
model_predictors <- dplyr::select(train, -crime)
dim(model_predictors)
dim(lda.fit$scaling)
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
3D Plot
Colored by the crime classes of the training data, we get the pink “High crime” data points grouped on their own (some overlap with med-high, though)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color= train$crime)
require("here");require("tidyverse")
#source(here("Scripts/create_human.R"))
rm(list=c("gii", "hd"))
## Warning in rm(list = c("gii", "hd")): object 'gii' not found
## Warning in rm(list = c("gii", "hd")): object 'hd' not found
Task 1: Explore the structure and the dimensions of the ‘human’ data and describe the dataset briefly, assuming the reader has no previous knowledge of it (this is now close to the reality, since you have named the variables yourself). (1 point)
library(readr)
human <- read_csv("Data/human.csv")
## New names:
## Rows: 195 Columns: 20
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (1): Country dbl (19): ...1, HDI_Rank, HDI, LifeExp, EduExp, EduMean, GNI,
## GNI_minus_HDIr...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
human<-select(human, 2:length(human))
str(human)
dim(human)
Task 2: Exclude unneeded variables: keep only the columns matching the following variable names (described in the meta file above): “Country”, “Edu2.FM”, “Labo.FM”, “Edu.Exp”, “Life.Exp”, “GNI”, “Mat.Mor”, “Ado.Birth”, “Parli.F” (1 point).
# Create a vector of vars to keep. Remember that you renamed your vars differently, copypasting the above names into a vector won't work:
keepers<-c("Country",
"PSECED_FM_Ratio",
"LaborFM_ratio",
"EduExp",
"LifeExp",
"GNI",
"MMRatio",
"ABRate",
"PRP")
# Select only these vars from human dataframe
human<-human %>% select(all_of(keepers))
Task 3: Remove all rows with missing values (1 point).
human<-filter(human, complete.cases(human)==T)
Task 4: Remove the observations which relate to regions instead of countries. (1 point)
human<-human %>% slice(1:(nrow(human)-7))
dim(human)
Task 5 The data should now have 155 observations and 9 variables (including the “Country” variable). Save the human data in your data folder. You can overwrite your old ‘human’ data. (1 point).
dim(human)
library(here)
write.csv(human, file=here("Data/human2.csv"))
Task 1: Move the country names to rownames (see Exercise 5.5). Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them. (0-3 points)
library(tibble)
human_<-column_to_rownames(human, "Country")
Commentary:
library(GGally)
ggpairs(human_, progress = F)
summary(human_)
## PSECED_FM_Ratio LaborFM_ratio EduExp LifeExp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI MMRatio ABRate PRP
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
head(arrange(human_, LifeExp, GNI), n= 20)
## PSECED_FM_Ratio LaborFM_ratio EduExp LifeExp
## Swaziland 0.8423077 0.6131285 11.3 49.0
## Lesotho 1.1526316 0.8027211 11.1 49.8
## Central African Republic 0.3782772 0.8531140 7.2 50.7
## Sierra Leone 0.4608295 0.9521739 8.6 50.9
## Côte d'Ivoire 0.4651163 0.6437346 8.9 51.5
## Chad 0.1717172 0.8080808 7.4 51.6
## Mozambique 0.2258065 1.0326087 9.3 55.1
## Cameroon 0.6103152 0.8307292 10.4 55.5
## Burundi 0.6385542 1.0158537 10.1 56.7
## South Africa 0.9578393 0.7355372 13.6 57.4
## Zimbabwe 0.7854839 0.9275362 10.9 57.5
## Mali 0.5099338 0.6240786 8.4 58.0
## Uganda 0.6835821 0.9570707 9.8 58.5
## Congo (Democratic Republic of the) 0.3950617 0.9658470 9.8 58.7
## Burkina Faso 0.2812500 0.8566667 7.8 58.7
## Benin 0.4185185 0.8633461 11.1 59.6
## Togo 0.3995037 0.9913899 12.2 59.7
## Zambia 0.5863636 0.8539720 13.5 60.1
## Gambia 0.5523810 0.8709288 8.8 60.2
## Afghanistan 0.1979866 0.1987421 9.3 60.4
## GNI MMRatio ABRate PRP
## Swaziland 5542 310 72.0 14.7
## Lesotho 3306 490 89.4 26.8
## Central African Republic 581 880 98.3 12.5
## Sierra Leone 1780 1100 100.7 12.4
## Côte d'Ivoire 3171 720 130.3 9.2
## Chad 2085 980 152.0 14.9
## Mozambique 1123 480 137.8 39.6
## Cameroon 2803 590 115.8 27.1
## Burundi 758 740 30.3 34.9
## South Africa 12122 140 50.9 40.7
## Zimbabwe 1615 470 60.3 35.1
## Mali 1583 550 175.6 9.5
## Uganda 1613 360 126.6 35.0
## Congo (Democratic Republic of the) 680 730 135.3 8.2
## Burkina Faso 1591 400 115.4 13.3
## Benin 1767 340 90.2 8.4
## Togo 1228 450 91.5 17.6
## Zambia 3734 280 125.4 12.7
## Gambia 1507 430 115.8 9.4
## Afghanistan 1885 400 86.8 27.6
Task 2: Perform principal component analysis (PCA) on the raw (non-standardized) human data. Show the variability captured by the principal components. Draw a biplot displaying the observations by the first two principal components (PC1 coordinate in x-axis, PC2 coordinate in y-axis), along with arrows representing the original variables. (0-2 points)
pca_human_nonstandard <- prcomp(human_)
summary(pca_human_nonstandard)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000 1.0000
biplot(pca_human_nonstandard, choices = 1:2)
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
Some interpretation. This is asked for in the next task:
As the data is not standardized, every variable is measured at a
different scale. Only GNI is now highlighted as a contributing variable,
as it range is massively higher than that of any other var and thus
dominates the analysis.
Basically, the data is now reduced into the “very wealthy countries” and “everybody else”.
GNI is negatively associated with the 1st PC, vector points towards small PC1 values (see also “loadings” table below, it shows essentially the same thing).
library(loadings)
pca_loading(pca_human_nonstandard)
## Warning in sqrt(1 - PC_loading^2): NaNs produced
## Standard deviations (1, .., p=8):
## [1] 1.854416e+04 1.855219e+02 2.518701e+01 1.145441e+01 3.766241e+00
## [6] 1.565912e+00 1.912052e-01 1.591112e-01
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4
## PSECED_FM_Ratio -5.607472e-06 0.0006713951 -3.412027e-05 -2.736326e-04
## LaborFM_ratio 2.331945e-07 -0.0002819357 5.302884e-04 -4.692578e-03
## EduExp -9.562910e-05 0.0075529759 1.427664e-02 -3.313505e-02
## LifeExp -2.815823e-04 0.0283150248 1.294971e-02 -6.752684e-02
## GNI -9.999832e-01 -0.0057723054 -5.156742e-04 4.932889e-05
## MMRatio 5.655734e-03 -0.9916320120 1.260302e-01 -6.100534e-03
## ABRate 1.233961e-03 -0.1255502723 -9.918113e-01 5.301595e-03
## PRP -5.526460e-05 0.0032317269 -7.398331e-03 -9.971232e-01
## PC5 PC6 PC7 PC8
## PSECED_FM_Ratio -0.0022935252 2.180183e-02 6.998623e-01 7.139410e-01
## LaborFM_ratio 0.0022190154 3.264423e-02 7.132267e-01 -7.001533e-01
## EduExp 0.1431180282 9.882477e-01 -3.826887e-02 7.776451e-03
## LifeExp 0.9865644425 -1.453515e-01 5.380452e-03 2.281723e-03
## GNI -0.0001135863 -2.711698e-05 -8.075191e-07 -1.176762e-06
## MMRatio 0.0266373214 1.695203e-03 1.355518e-04 8.371934e-04
## ABRate 0.0188618600 1.273198e-02 -8.641234e-05 -1.707885e-04
## PRP -0.0716401914 -2.309896e-02 -2.642548e-03 2.680113e-03
Task 3: Standardize the variables in the human data and repeat the above analysis. Interpret the results of both analysis (with and without standardizing). Are the results different? Why or why not? Include captions (brief descriptions) in your plots where you describe the results by using not just your variable names, but the actual phenomena they relate to. (0-4 points)
human__scaled<-scale(human_)
summary(human__scaled)
## PSECED_FM_Ratio LaborFM_ratio EduExp LifeExp
## Min. :-2.8189 Min. :-2.6247 Min. :-2.7378 Min. :-2.7188
## 1st Qu.:-0.5233 1st Qu.:-0.5484 1st Qu.:-0.6782 1st Qu.:-0.6425
## Median : 0.3503 Median : 0.2316 Median : 0.1140 Median : 0.3056
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5958 3rd Qu.: 0.7350 3rd Qu.: 0.7126 3rd Qu.: 0.6717
## Max. : 2.6646 Max. : 1.6632 Max. : 2.4730 Max. : 1.4218
## GNI MMRatio ABRate PRP
## Min. :-0.9193 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203
## 1st Qu.:-0.7243 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409
## Median :-0.3013 Median :-0.4726 Median :-0.3298 Median :-0.1403
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3712 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127
## Max. : 5.6890 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850
human__scaled<-scale(human_)
pca_human_standard <- prcomp(human__scaled)
summary(pca_human_standard)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
## PC8
## Standard deviation 0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion 1.00000
s<-summary(pca_human_standard)
Pros<-round(100*s$importance[2, ], digits = 2)
Pros<-paste0(names(Pros), " (", Pros, "%)")
biplot(pca_human_standard, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab= Pros[1], ylab=Pros[2])
pca_loading(pca_human_standard)
## Standard deviations (1, .., p=8):
## [1] 2.0708380 1.1397204 0.8750485 0.7788630 0.6619563 0.5363061 0.4589994
## [8] 0.3222406
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4 PC5
## PSECED_FM_Ratio -0.35664370 0.03796058 -0.24223089 -0.62678110 0.5983585
## LaborFM_ratio 0.05457785 0.72432726 -0.58428770 -0.06199424 -0.2625067
## EduExp -0.42766720 0.13940571 -0.07340270 0.07020294 -0.1659678
## LifeExp -0.44372240 -0.02530473 0.10991305 0.05834819 -0.1628935
## GNI -0.35048295 0.05060876 -0.20168779 0.72727675 0.4950306
## MMRatio 0.43697098 0.14508727 -0.12522539 0.25170614 0.1800657
## ABRate 0.41126010 0.07708468 0.01968243 -0.04986763 0.4672068
## PRP -0.08438558 0.65136866 0.72506309 -0.01396293 0.1523699
## PC6 PC7 PC8
## PSECED_FM_Ratio 0.17713316 0.05773644 0.16459453
## LaborFM_ratio -0.03500707 -0.22729927 -0.07304568
## EduExp -0.38606919 0.77962966 -0.05415984
## LifeExp -0.42242796 -0.43406432 0.62737008
## GNI 0.11120305 -0.13711838 -0.16961173
## MMRatio 0.17370039 0.35380306 0.72193946
## ABRate -0.76056557 -0.06897064 -0.14335186
## PRP 0.13749772 0.00568387 -0.02306476
Task 5: The tea data comes from the FactoMineR package and it is measured with a questionnaire on tea: 300 individuals were asked how they drink tea (18 questions) and what are their product’s perception (12 questions). In addition, some personal details were asked (4 questions).
Explore the data briefly: look at the structure and the dimensions of the data. Use View(tea) to browse its contents, and visualize the data. Use Multiple Correspondence Analysis (MCA) on the tea data (or on just certain columns of the data, it is up to you!). Interpret the results of the MCA and draw at least the variable biplot of the analysis. You can also explore other plotting options for MCA. Comment on the output of the plots. (0-4 points)
library(FactoMineR)
## Warning: package 'FactoMineR' was built under R version 4.3.2
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)
view(tea)
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
## $ frequency : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300 36
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
## $ frequency : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300 36
Plenty of factor variables (class vars) describing tea usage.
Dims 300 x 36
Let’s simplify a bit and select a subset of these
Visual summary
# column names to keep in the dataset
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
# select the 'keep_columns' to create a new dataset
tea_time <- select(tea, keep_columns)
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(keep_columns)
##
## # Now:
## data %>% select(all_of(keep_columns))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# visual summary
library(ggplot2)
pivot_longer(tea_time, cols = everything()) %>%
ggplot(aes(value)) + facet_wrap("name", scales = "free")+geom_bar()+theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
Multiple Correspondence Analysis
# multiple correspondence analysis
library(FactoMineR)
mca <- MCA(tea_time, graph = F)
summary(mca)
plot(mca, invisible=c("ind"), graph.type = "classic")
library(here)
library(tidyverse)
RATS<-read.csv(here("Data/rats_lng.csv"))
RATS$X<-NULL
RATS$ID<-as.factor(RATS$ID)
RATS$Group<-as.factor(RATS$Group)
library(ggplot2)
ggplot(RATS, aes(x = Time, y = Weight , linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(RATS$Weight), max(RATS$Weight)))
RATS <- RATS %>%
group_by(Group) %>%
mutate(stdweight = (Weight-mean(Weight))/sd(Weight)) %>%
ungroup()
library(ggplot2)
ggplot(RATS, aes(x = Time, y = stdweight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
scale_y_continuous(name = "standardized weight")
library(plotrix)
## Warning: package 'plotrix' was built under R version 4.3.2
# Summary data with mean and standard error of weight by group and time
RATS <- RATS %>%
group_by(Group, Time) %>%
summarise( mean = mean(Weight), se = std.error(Weight)) %>%
ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
# Plot the mean profiles
library(ggplot2)
ggplot(RATS, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
geom_line() +
scale_linetype_manual(values = c(1,2,3)) +
geom_point(size=3) +
scale_shape_manual(values = c(1,2,3)) +
geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
theme(legend.position = c(0.95,0.45)) +
scale_y_continuous(name = "mean(Weight) +/- se(Weight)")